Readme

require(tidyverse)
require(DiagrammeR)
require(poppr)
require(genepop)
require(graph4lg)
require(related)
require(adegenet)
require(knitr)
require(magrittr)

This is document is an R notebook. If you’d like view to pre-rendered figures, read a summary of analysis and interact with code, please open the relevant html file in a browser.

To conduct a similar analyses on your computer, edit or run code: clone this repository into a directory on your local machine and open the .Rproj file in Rstudio.

Rationale

Genotyping-in-Thousands by sequencing (GT-seq) is a cost effective method developed by Campbell et al (2015) to genotype up to thousands of samples at a small (~500) set of markers using the illumina platform.

This document:
(a) outlines the bioinformatic procedures for processing raw GT-seq reads to a fully filtered SNP dataset according to the standards adopted by the State Fisheries Genomics Lab at Oregon State University
(b) shares instructions for setting up the environment to run the scripts
(c) provides an example detailed protocol genotyping Oncorhynchus mykiss samples with example data, figures, and results
(d) directs GT-seq users to relevant metadata and scripts to conduct their own genotyping (i.e probe sequences)

GTseq outline

grViz("digraph flowchart {
      # node definitions with substituted label text
      node [fontname = Helvetica, shape = rectangle]        
      tab1 [label = '@@1']
      tab2 [label = '@@2']
      tab3 [label = '@@3']
      tab4 [label = '@@4']
      tab5 [label = '@@5']
      tab6 [label = '@@6']
      tab7 [label = '@@7']
      tab8 [label = '@@8']
      # edge definitions with the node IDs
      tab1 -> tab2 [label = 'GTseq_BarcodeSplit']
      tab2 -> tab3 [label = 'GTseq_Genotyper']
      tab3 -> tab4 [label = 'GTseq_genocompile']
      tab4 -> tab5 [label = 'filtering']
      tab4 -> tab6 [label = 'GTseq_summaryfigs']
      tab3 -> tab7 [label = 'GTseq_genocompilecounts']
      tab7 -> tab5 
      tab1 -> tab8 [label = 'GTseq_seqtest']
      tab7 -> tab4 [label = 'GTseq_Omysex']
      tab6 -> tab5
      }
      [1]: 'Raw Reads'
      [2]: 'Demultiplezed fastqs'
      [3]: 'Individual genotypes'
      [4]: 'raw GTseq dataset'
      [5]: 'final filtered dataset'
      [6]: 'summary figures'
      [7]: 'read counts'
      [8]: 'marker info'
      
      ")

Full Gt-seq genotyping flowchart: nodes are data, edges are scripts

grViz("digraph flowchart {
      # node definitions with substituted label text
      node [fontname = Helvetica, shape = rectangle]        
      tab1 [label = '@@1']
      tab2 [label = '@@2']
      tab3 [label = '@@3']
      tab4 [label = '@@4']
      tab5 [label = '@@5']
      # edge definitions with the node IDs
      tab1 -> tab2 [label = 'GTseq_BarcodeSplit']
      tab2 -> tab3 [label = 'GTseq_Genotyper']
      tab3 -> tab4 [label = 'GTseq_genocompile']
      tab4 -> tab5 [label = 'filtering']
      }
      [1]: 'Raw Reads'
      [2]: 'Demultiplezed fastqs'
      [3]: 'Individual genotypes'
      [4]: 'raw GTseq dataset'
      [5]: 'final filtered dataset'
      
      ")

Typical Gt-seq genotyping flowchart when using established panel: nodes are data, edges are scripts

Overview

The GTseq pipeline consists of a handful of perl and python scripts to process raw sequencing reads to a fully filtered SNP dataset. Here’s a quick reference to each of the steps in a typical pipeline.

Sequencing QC
- Check that the reads look good from the sequencer with fastqc or similar program

GT-seq Scripts
These steps are done on a server:
- Demultiplex reads: The first step is to demultiplex sequencing data if demuxed files are not provided by sequencing center using the GTseq_Barcode_Split_MP.py script or a different demultiplexing approach such as deML. Our GTseq data primarily comes demultiplexed from the sequencing center, so this step is usually skipped.
- Call genotypes: Use GTseq_Genotyper_v3.1.pl to call genotypes on demultiplexed reads.
- (Optional) Call Sex Genotypes: If species has unique script for calling the sex markers (O. mykiss and O. tshawytscha), run these as well
- Compile: After all the individual genotypes are called, compile them into a single output using the GTseq_GenoCompile_v3.pl script.

QAQC Check:
- Check positive and negative controls (for plate flipping, other library prep errors)
- Check known genotype controls if included (e.g. winter vs summer run steelhead controls)
- Check technical replicates for concordance
- Remove controls and replicates if all looks good

Filtering:
Filtering operates in multiple stages. The final filtering cutoffs are below:

  • IFI_cutoff = 2.5 (remove individuals with low confidence)
  • GTperc_cutoff=90 (exclude individuals with greater than 10% missing data)
  • Missingness (loci) > 20% (remove loci with >20% missing data)
  • Missingness (loci) > 10% (examine moderately bad loci for incorrect allele correction issues or paralogs and attempt to rescue)
  • Skewed Allele Count Ratios (examine loci for potential paralogs)
  • Remove monomorphic SNPs
  • Remove duplicated individuals

Environment setup and how to use this SOP

This SOP is run in two environments, the OSU cluster and a local instance of R. All code needed to run the SOP can be found here. You can open this document as a .rmd in Rstudio when processing your own data, or simply copy and paste the code found in the code chunks into your own script or terminal.

A few optional code chunks also use the terminal on a unix based local machine (i.e. a mac), but they can also run entirely on the cluster, or on a windows based machine with a little work (just run the unix based commands in a unix/bash based terminal on a windows machine like cygwin or the new bash shell available on windows 10 install link here ).

Commands sent to the terminal appear as code chunks in this SOP and narrative is provided in the main body. Click the gray “code” box to see the commands and more details about a step. The appropriate environment for running each code chunk will appear as a comment at the head the chunk. There are also full scripts submitted as jobs to the server. These need to be saved as a file and submitted with the appropriate command from the server scheduler such as qsub or SGE_Batch. Also some scripts and commands submitted on the OSU cluster need to be wrapped into a command for submission (e.g SGE_Batch) or submitted on a interactive node. Don’t run things on the login node.

Local R Setup

The easiest way to both get the scripts you will need locally and grab all the example data you will need to run the example script is to clone the entire GTseq-SOP repository onto your local machine. ALWAYS MAKE SURE YOU CLONE THE MOST RECENT VERSION OF THE REPOSITORY BEFORE RUNNING A NEW ANALYSIS See the code chunk below on how to do this.

Note also that to run the full GTseq example script using the example data, use the data found at example_data/ in the repository (you don’t need to transfer the data off the server, but it should match if you do). Details are in the relevant sections of the SOP below.

Alternatively, copying the directory in this repository named “example data” will provide all the relevant files to run the local portion (R) of the example protocol, without cloning the full repository.

Make sure you load the required libraries (run the first code chunk in this notebook).

# Local

# note: this is optional and only needed if you want to follow along with the example code on your own machine

#you can also skip the command line version of the git clone process and instead click the clone link under the "code" button on the repository home page.

# move to your working directory 
# create a directory for this repository and move inside
mkdir GT-seq
cd GT-seq

#copy the whole repository using git clone
git clone https://github.com/State-Fisheries-Genomics-Lab/GT-seq.git

# after you've cloned the repository, open the file GT-seq.Rproj in Rstudio for all features

Finally, paths to files and scripts in the SOP are on David Dayan’s directories, you will need to modify the scripts to reflect your own paths.

OSU Cluster Setup

Scripts
In order to run the GTseq, you must have a copy of the most current GTseq scripts stored on the cluster. As there are multiple versions of the scripts, it is recommended to download them into directory on the cluster directly from the github repository, which is actively maintained.

Option A: File transfer GUI

Move all the files from the script directory to your own directory on the server using filezilla or whatever file transfer software you like.

Option B: Git

You can also clone whole repository onto the server with git and then move relevant files to where you want them

# SERVER

# clone the repository somewhere temporary
mkdir ~/tmp
cd ~/tmp
git clone https://github.com/State-Fisheries-Genomics-Lab/GT-seq


# choose home directory for your software
# make directory for GTseq scripts and move files inside
mkdir /dfs/Omalley_Lab/dayan/software/GTseq-Pipeline/
mv  ~/tmp/GT-seq/GT-seq_scripts/* /dfs/Omalley_Lab/dayan/software/GTseq-Pipeline/

#delete your temporary repo
bash #default shell is zcsh which requires you to confirm EVERY SINGLE file deletion, run bash instead
rm -r ~/tmp/* #caution this is scary, be sure there's no typo here
rmdir ~/tmp

Perl and Python Libraries

The GT-seq pipeline uses both perl and python. All of the relevant libraries should be installed on the server except StringApprox. You will need to install StringApprox and configure your server environment so it is included it in your path.

Installing:


# SERVER

# You have three options to install String::Approx, I confirmed the first one works, but the others may be simpler

#### Option 1: make

# first download the tarball for String Approx here ( https://metacpan.org/release/String-Approx ) into a temporary directory on the server

# then, unpack the tarball and follow the instructions in the readme ( copied below) 
perl Makefile.PL
        make
        make test
        make install

#### Option 2: Conda
conda install -c bioconda perl-string-approx

#### Option 3: cpan
perl -MCPAN -e shell
install String::Approx

Setting the environment:

Before running any scripts that require the String::Approx module, you must set the environment with the following commands. This will be noted in the example scripts below but can also be found here.


#SERVER

export PERL5LIB='/home/fw/dayand/perl5/lib/perl5/x86_64-linux-thread-multi/' # if using bash as your shell (all of the time for the SOP)

setenv PERL5LIB ~/perl5/lib/perl5/x86_64-linux-thread-multi/ #for tcsh (default shell on server, but all scripts that require this run in bash)

Panel Info

This repository contains all the files needed to run the pipeline on new data and additional information about the panel markers. See readmes in each subdirectory for up to date contents.

Probe_Sequences
The probe sequences used by the genotyper script to call SNPs at the panel markers. Link here

Panel Info
Information about the panel markers such as amplicon sequences, annotations, sources, primer and probe sequences etc. Link here

Marker Mapping
Results from mapping studies. Link here

Example Script

Here is the example script of a full genotyping pipeline from raw reads to fully filtered SNP dataset.

Data Summary

The example dataset uses fall run and half-pounder steelhead from the Rogue River. Note that the data on the server can be used to run the full pipeline, but you can also skip the server side analysis (genotype calling) and run just the local side analysis (filtering and data visualization) using the example data provided at the Github repository.

Sample Summary

Let’s gather all the metadata about the physical samples and summarize:

# LOCAL R

meta_data <- read_tsv("example_data/metadata/example_meta_data.txt")

kable(meta_data %>%
  group_by(run, year) %>%
  tally())
run year n
fall 2019 54
halfpounder 2019 88

Sequencing Data Summary

Data came already demultiplexed from sequencing center. Location in code chunk below. 153 fastqs, including some controls and replicates.

# SERVER

/dfs/Omalley_Lab/dayan/gtseq_sop/raw_fastqs # 153 fastqs, 

Raw Quality Report for Full Lane Clusters: 409,149,535
Yield (mbase): 61,782
% >Q30 bases: 87.13
Average Qual: 37.18

Run fastqc report or other look at sequence data (not included in SOP) to get a look at this kind of information before running your own genotyping.

Demultiplex

The first step is to demultiplex the raw sequencing file using the i5 and i7 indexes. We can usually skip this step in the example script because the sequencing center already performed demultiplexing. Instead, just copy the demuxed files to a working directory.

# SERVER

#example script to copy the data to your own directory

SGE_Batch -c "cp /dfs/Omalley_Lab/dayan/gtseq_sop/raw_fastqs/*fastq /path/to/your/fastq/directory" -q harold -r copy

If instead, you are supplied with multiplexed fastq files (boo) you can use the GTseq scripts to demultiplex your data. First you need to generate a key file.

# LOCAL R

# example: Sample,PlateID,i7_name,i7_sequence,i5_name,i5_sequence
#          Sample123,P1234,i001,ACCGTA,25,CCCTAA
#          Sample321,P1234,i001,ACCGTA,26,GGCACA

#first lets get the index sequences 
index2020 <- read_tsv("example_data/metadata/index_2020.txt")
colnames(index2020) <- c("Sample","PlateID","i7_name","i7_sequence","i5_name","i5_sequence")
write_csv(index2020, "example_data/metadata/index_2020_lane.csv")

Then you run the script.

# SERVER

#note this is from Sandra's protocol so the directories are different, but still same approach

#Set directory
usedir='/nfs1/FW_HMSC/OMalley_Lab/bohns/GTseq/OmyRogue'

#First argument is list of barcodes and second argument is .fastq file.
#Make sure you dos2unix the barcode file if you made it in Excel or Windows!

mkdir $usedir'/sample_fastqs'
cd $usedir'/sample_fastqs'

#here is the command you will submit
python '/nfs1/FW_HMSC/OMalley_Lab/bohns/GTseq/pipeline/GTseq_BarcodeSplit_MPargs.py' $usedir'/OmyROGR_index-list.csv' '/nfs1/FW_HMSC/OMalley_Lab/bohns/GTseq/RAW_READS/OmyROGROtsFERHCLAR.fq'

#here is how to wrap it correctly for submission to the server queue
SGE_Batch -q harold -r omysex -c "python '/nfs1/FW_HMSC/OMalley_Lab/bohns/GTseq/pipeline/GTseq_BarcodeSplit_MPargs.py' $usedir'/OmyROGR_index-list.csv' '/nfs1/FW_HMSC/OMalley_Lab/bohns/GTseq/RAW_READS/OmyROGROtsFERHCLAR.fq'"

Genotype

Main Genotyper
Next we’ll run the GTseq genotyper (v3.1) script on each fastq file to generate the individual genotypes (.genos) . Note that there are many copies of the genotyper script. 3.1 is the current version (assymetrical genotype caller bug fixed)

We need to run this script for each demultiplexed fastq. To speed this up we’ll use an array job. The code chunk below contains an example array job script. To run the job, you’ll save the script on the server and run it using the qsub command.

The genotyper script requires unzipped fastq files for input. The first code chunk below can decompress the input if your demultiplexed files are compressed (example files do not need to be decompressed). The second code chunk contains the genotyper script.

# SERVER

# Decompression script

# note 1: this is a script to save and submit as a job, save everything below the long ########### below

#note 2: the number of threads to use (-t option) is hardcoded to match the number of input files, change this number to reflect how many fastq.gz files you have

################################# save everything below this as a file
#!/bin/bash
#$ -S /bin/bash
#$ -t 1-153
#$ -tc 20
#$ -N decompress
#$ -cwd
#$ -o $JOB_NAME_$TASK_ID.out
#$ -e $JOB_NAME_$TASK_ID.err

FASTQS=(`ls *fastq.gz`)
INFILE=${FASTQS[$SGE_TASK_ID -1]}

gunzip -c $INFILE > ${INFILE%.gz}

#save as script and submit this with qsub -q harold scriptname
####################################
# SERVER 

# Genotyper Script

# note 1: this is a script to save and submit as a job, save everything below the long ########### below

#note 2: the number of threads to use (-t option) is hardcoded to match the number of input files, change this number to reflect how many fastq files you have in the directory

################################# save everything below this as a file
#!/bin/bash
#$ -S /bin/bash

#$ -t 1-153

#$ -tc 20

#$ -N GTseq-genotyperv3

#$ -cwd

#$ -o $JOB_NAME_$TASK_ID.out

#$ -e $JOB_NAME_$TASK_ID.err
export PERL5LIB='/home/fw/dayand/perl5/lib/perl5/x86_64-linux-thread-multi/' #you may want to change this to your own perl lib destination 

FASTQS=(`ls /dfs/Omalley_Lab/dayan/gtseq_sop/raw_fastqs/*fastq`) #reminder to change the directory to your copy of the fastqs
INFILE=${FASTQS[$SGE_TASK_ID -1]}
OUTFILE=$(basename ${INFILE%.fastq}.genos)

GTSEQ_GENO="/dfs/Omalley_Lab/dayan/software/GTseq-Pipeline/GTseq_Genotyper_v3.1.pl" #again, change this path to your own copy of this script

PROBE_SEQS="/dfs/Omalley_Lab/dayan/software/GTseq-Pipeline/Omy_GTseq390_ProbeSeqs.csv" #change the probe seq file to match whatever panel you are using

perl $GTSEQ_GENO $PROBE_SEQS $INFILE > $OUTFILE

#save this code chunk as a file on the server and submit this with the following command from the directory you want the output .genos files:
# qsub -q harold scriptname 
# note that you might submit to a different -q than harold

Some notes about this script:
-t : this is the total number of task id to use. this script will iterate over this list. so if you have 100 fastq files to genotype set this to 1-100.
-tc: this is the throttle. the value used in the script above (20), means we will submit the 153 separate commands in batches of 20 until it is done. you should check that this many nodes are available in the submit queue (use SGE_Avail)
submitting: I usually store all of the scripts I will submit to the server in a directory named “sge,” so if I wanted to submit this script (named gtseq_geno.txt) to the otter queue, I’d submit the command qsub -q otter path/to/my/scripts/sge/gtseq_geno.txt . you can check the progress with the qstat command.

Sex Genotyper

After the genotypes are written for the panel, we add the sex genotyper.

# SERVER

# notes

# the omysex script (and otssex) is hardcoded to require the fastqs and genos to all be in a single collective directory, for fastqs to be decompressed, and for it to be run from this directory
# rather than try to fix this hardcoding in the script, a shortcut is to move all the fastqs into the directory with your .genos from the previous step then move them back after to conserve space
# another hardcoded part of the omysex script involves how it parses sample names. all fastqs must be samplename.fastq and all genos must be samplename.genos

##### move decompressed fastqs to directory with your .genos

#from the directory with the .genos files
mv /dfs/Omalley_Lab/dayan/gtseq_sop/raw_fastqs/*fastq ./

##### sex genotyper
#below is the command to run the omysex script, note that it is wrapped into and SGE_Batch command, you can also use an interactive node (e.g. qrsh), but don't run this on the login node

SGE_Batch -q harold -r omysex -c 'perl /dfs/Omalley_Lab/dayan/software/GTseq-Pipeline/OmySEX_test_v3.pl'

##### don't forget to move these files back when done

Compile Genotypes

After all the .genos are written, we compile them into a single output using the GenoCompile script

#SERVER

#this is run from the directory with your .genos files
SGE_Batch -q harold -r compile -c 'perl /dfs/Omalley_Lab/dayan/software/GTseq-Pipeline/GTseq_GenoCompile_v3.pl > example_GTs_0.1.csv' 

Congratulations, you’ve run the pipeline! Now we will need to do quality control and filtering.

At this point in the suggested published protocol, you would run the SummaryFigures script from the GTseq pipeline. However, this script contains harcoded values for genotype calling that are different from the actual genotyping script and will misrepresent the data. Do not run it, it will mislead you. We will produce similar figures in the next section.

At this point you can move the file “example_GTs_0.1.csv” onto your local machine to conduct filtering.

QAQC

Here we do some basic quality control. We check positive (known good DNA) and negative (known blank samples) controls, as well as control for known genotypes (i.e. known winter run, etc). After controls, we check for concordance among sample replicates to check that no wetlab errors occured that might have scrambled indexing/barcoding.

If you just want to run the protocol from this point forward the outputs from the server steps are available in the /example_data directory of this repository.

Marker Info

The first step in the QA-QC process is to collect some information about genotying success from the .genos files. We’ll do this with an awk one liner.

The script below will pull the allele count ratios and read counts for all individuals in the pipeline

# SERVER

#run from directory with your .genos (use a SSGE_Batch job or interactive shell)

#collect marker info from all the genos files
bash
touch marker_info.csv
echo 'ind,marker,a1_count,a2_count,called_geno,a1_corr,a2_corr' >> marker_info.csv
for file in ./*genos
do
    awk -F"," ' BEGIN { OFS="," } {print FILENAME,$1,$2,$3,$6,$7,$8}' $file >> marker_info.csv
done

# now we'll cleanup this file so that it is easier to work with
sed -i '/Raw-Reads/d' ./marker_info.csv #first get rid of genos headers
sed -i '/negative/d' ./marker_info.csv # then get rid of controls
sed -i '/positive/d' ./marker_info.csv # then get rid of controls

Read in the marker info file and clean it up a little more. Note you’ll have to transfer the file off the server for this.

#LOCAL R

marker_info <- read_csv("example_data/genotype_data/marker_info.csv")

#this part changes the values of A=2, G=898, -=52, etc for the allele count columns to the actual values, and gets rid of some mess in the sample names (ind)
marker_info %<>%
  mutate(a1_count =  as.numeric(substr(a1_count, 3, nchar(a1_count)))) %>%
  mutate(a2_count =  as.numeric(substr(a2_count, 3, nchar(a2_count)))) %>%
  mutate(ind = str_remove(ind, "^\\./")) %>%
  mutate(ind = str_remove(ind, "\\.genos"))

Controls

First let’s check that the controls worked well. We will check that negative controls have much fewer reads than average (there may be some on-target reads from othr samples due to index sequence error)

# LOCAL R

# First we are going to prep the raw genotype data for filtering.

# read the raw genotypes file in to R
genos_0.1 <- read_csv("example_data/genotype_data/example_GTs_0.1.csv")

# add a field to mark controls
# here controls contained "positive," "negative" in their sample names so used simple pattern matching to create a new column, you can add you own here for known controls (e.g. known winter run steelhead)
genos_0.1 %<>%
  mutate(control = case_when(str_detect(Sample, "positive") ~ "positive",
                             str_detect(Sample, "negative") ~ "negative",
                             TRUE ~ "sample"))

# clean up sample name field
# split the sample name and the adapter sequence (note that replicates will have the same sample name, but we'll keep track with the adapter sequences)

genos_0.1 %<>%
  mutate(adapter = str_extract(Sample, "[ATCG]{6}-[ATCG]{6}")) %>%
  mutate(sample_simple = str_extract(Sample, "[:upper:][:lower:]{2}[AJCU][RC]\\d{2}\\w{4}_\\d{4}")) %>%
  relocate(Sample, sample_simple, adapter)

# great, prep is done, now lets make our first plot: distribution of reads between controls and samples
ggplot()+geom_histogram(data = genos_0.1, aes(x = `On-Target Reads`, fill= control)) + theme_classic()+scale_fill_viridis_d()

Looks good (somewhat). Negative controls have very few reads, positive are mostly in the center of the read distribution, (some have zero, but they just went bad…). Let’s also double check that there isn’t a negative control with a lot of reads hiding in there (this becomes possible when there are thousands of samples) and indicating a plate flip :

#LOCAL R

ggplot()+geom_histogram(data = genos_0.1[genos_0.1$control=="negative",], aes(x = `On-Target Reads`)) + theme_classic()

Looks great, but let’s also examine as a portion of total reads, sometimes a negative can have a lot of reads from contamination, but they should all be off target.

#LOCAL R
ggplot()+geom_histogram(data = genos_0.1, aes(x = `%On-Target`, fill= control)) + theme_classic()+scale_fill_viridis_d()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Looks good (except for those aging positive controls)

Replicates

Some samples were replicated, let’s check for concordance in the genotypes, the pick the sample with better GT success and throw out the duplicate.

#LOCAL R

# here we filter out our known controls and create our next dataset genos_0.11
genos_0.11 <- genos_0.1 %>%
  filter(control == "sample") %>%
  select(-control) #get rid of this column

# sometimes we'll use more than a single replicate, this broke a previous version of this code chunk, for now we'll find the triplicate (or more) samples, and keep the two with greatest on target read depth (ignoring triplicates quadruplicates etc)
 
genos_0.11 %<>% 
  group_by(sample_simple) %>%
  slice_max(order_by = `On-Target Reads`, n = 2)

#now let's get duplicated samples
dups <- genos_0.11[genos_0.11$sample_simple %in% genos_0.11$sample_simple[duplicated(genos_0.11$sample_simple)],]
dups <- dups[order(dups$sample_simple),]

# next we'll calculate the percent concordance among replicates
# woof I don't see a good way around using a nested for loop here, maybe fix this in the future

dups_genos <- dups[,9:ncol(dups)] 
rep_info <- matrix(ncol=ncol(dups_genos), nrow=nrow(dups_genos)/2)
colnames(rep_info) <- colnames(dups_genos)
for (j in 1:(nrow(dups_genos)/2)) {
for (i in 1:ncol(dups_genos)) {
  rep_info[j,i] <- sum(dups_genos[(j*2)-1,i]==dups_genos[(j*2),i])
}
  }

geno_concordance <- as.data.frame(as.matrix(rep_info)) %>%
  rowMeans()

rep_data <- as.data.frame(cbind(dups[c(1:length(geno_concordance))*2,1], geno_concordance))
ggplot(data=rep_data)+geom_histogram(aes(x=geno_concordance))+theme_classic()

There are ~4 replicated samples in the example data. Mean concordance is high, but there’s some variance there to explore. Next let’s examine whats going on with the paur of samples with lower concordance.

#LOCAL R

#get the bad samples
bad_reps <- genos_0.11[genos_0.11$Sample %in% rep_data[rep_data$geno_concordance<0.7,1],1:8]
bad_reps[order(bad_reps$Sample),]

One of the pair just has extremely low %On-Target reads

Replication Summary
Replicate samples look good, bad replicates (<50% concordance) seems to be due to one replicate of the pair having extremely low GT success.

Next let’s make the 0.2 dataset (i.e. remove the replicates with lower GT success).

# LOCAL R

#this writes a new dataset (0.2) by choosing the samples within duplicates and keeping the one with the highest genotyping success
genos_0.2 <- genos_0.11 %>%
  group_by(sample_simple) %>%
  filter(`On-Target Reads` == max(`On-Target Reads`))

Sex Ratios

If you are using the OmySEX and OtsSEX scripts to call sex genotypes, read this section. If else, skip it, and move on to filtering.

The OmySEX and OtsSEX scripts rely on hardcoded estimates of the proportion of reads dedicated to the sex marker to call sex genotypes. This can sometimes go awry if the sex marker does not amplify as expected during the library prep (e.g. primers are aging, proportion of primers in the primer pool is off, etc).

In this portion of the SOP we will check if the sex genotyping script worked well, and if not, apply a correction.

Sex Script Check

First, let’s check if the script worked correctly.

#LOCAL R

# Plot sex marker counts

sex_marker_info <- marker_info %>%
  filter(str_detect(marker, "SEXY")) %>%
  mutate(called_geno = replace_na(called_geno, "00")) %>%
  mutate(called_geno = case_when(called_geno == "A1HOM" ~ "XX",
                                 called_geno == "HET" ~ "XY",
                                 called_geno == "00" ~ "00"))
  
ggplot(data = sex_marker_info)+geom_point(aes(a2_count, a1_count, color = called_geno))+scale_colour_viridis_d(name = "Called Sex Genotype")+theme_classic()+xlab("Y-specific probe count")+ylab("Theoretical X chromosome count")+geom_abline(aes(intercept = 0, slope = 0.1))+geom_abline(aes(intercept = 0, slope = 5))+geom_abline(aes(intercept = 0, slope = 10))+geom_abline(aes(intercept = 0, slope = 0.2))+xlim(0,max(c(sex_marker_info$a1_count, sex_marker_info$a2_count)))+ylim(0,max(c(sex_marker_info$a1_count, sex_marker_info$a2_count)))+geom_abline(aes(intercept = 0, slope = 1), color = "darkred")

kable(sex_marker_info %>% count(called_geno), caption = "Called Sex Ratio")
Called Sex Ratio
called_geno n
00 23
XX 92
XY 31

In the example data presented above (if you’re running this script on your own data this will look different so disregard), we can see that there are two very clear clusters of data: one with very few Y-counts (green cluster) and a second that includes both uncalled (00) and male genotypes (purple and yellow cluster). This second cluster should follow a 1:1 ratio between Y-specific probe count and Theoretical X chromosome count (i.e. the dark red line in the plot), but we see here that it does not. Something has gone wrong. There are too many Theoretical X chromosome counts. This could be caused by the Y-specific primers in the pool going bad, or too small of an amount mixed into the primer pool. In any case it looks like males are being undercalled.

If your data looks good (clear clusters, males follow the 1:1 line, and very few uncalled sex genotypes), then move on to filtering. If your data looks bad, let’s do some correction.

Sex Genotype Correction

Here, instead of relying on a harcoded value for the proportion of reads dedicated to the sex marker, we will attempt to estimate this value from the empirical data, and use this estimate to re-call sex genotypes.

#LOCAL R

# First let's collect the number of on-target reads for each sample in the sex_marker_info dataframe
sex_marker_info$sample <- str_extract(sex_marker_info$ind, "[:upper:][:lower:]{2}[AJCU][RC]\\d{2}\\w{4}_\\d{4}")
sex_marker_info$adapter <- str_extract(sex_marker_info$ind, "[ATCG]{6}-[ATCG]{6}") 

sex_marker_info %<>%
  left_join(dplyr::select(genos_0.1, sample_simple, adapter, `On-Target Reads`), by = c("sample" = "sample_simple" , "adapter" = "adapter"))

# now here's the tricky part. we're going to do a first pass filter to eliminate any samples that coldn't possibly be males, then estimate the theoretical number of X chromosome counts for possible males
#first get the estimate
Xmale_proportion_error <- sex_marker_info %>%
  filter(a2_count > 1) %>% #this value is set to 1 if the actual count is zero in the omysexscript, so here we're are eliminating all individuals where it woulbe extremely unlikely that they are males 
  summarise(Xmale_proportion = mean(a2_count/`On-Target Reads`))

#then use the estimate to correct the data
sex_marker_info %<>%
  mutate(XY_count_estimate = Xmale_proportion_error[[1]]*`On-Target Reads`) %>%
  mutate(corrected_ratio = XY_count_estimate/a2_count) %>%
  mutate(corrected_sex_genotype = case_when(corrected_ratio > 10 ~ "XX",
                                            corrected_ratio <= 0.1 ~ "XY",
                                            corrected_ratio <= 0.2 ~ "00",
                                            corrected_ratio <= 5 ~ "XY",
                                            TRUE ~ "00"
                                            )) #note these ratios are taken directly from the omysex script but this seems like another bug, shouldn't XY < 0.2 be NA not XY???

Now we’ll plot again and see if this correction worked as expected.

#LOCAL R

ggplot(data = sex_marker_info)+geom_point(aes(a2_count, XY_count_estimate, color = corrected_sex_genotype), alpha = 0.7)+scale_colour_viridis_d(name = "Corrected Called\nSex Genotype")+theme_classic()+xlab("Y-specific probe count")+ylab("Corrected Theoretical X chromosome count")+geom_abline(aes(intercept = 0, slope = 0.1))+geom_abline(aes(intercept = 0, slope = 5))+geom_abline(aes(intercept = 0, slope = 10))+geom_abline(aes(intercept = 0, slope = 0.2))+xlim(0,max(c(sex_marker_info$XY_count_estimate, sex_marker_info$a2_count)))+ylim(0,max(c(sex_marker_info$XY_count_estimate, sex_marker_info$a2_count)))+geom_abline(aes(intercept = 0, slope = 1), color = "darkred")

kable(sex_marker_info %>% 
        group_by(corrected_sex_genotype) %>%
        summarise(count = n(), mean_depth = mean(`On-Target Reads`)), caption = "Corrected Called Sex Ratio")
Corrected Called Sex Ratio
corrected_sex_genotype count mean_depth
00 3 36112.33
XX 85 143630.35
XY 58 140759.38

In the example data, this correction looks like it worked: Called males follow a 1:1 line and there are very few uncalled sex genotypes, except for samples with very low total on-target reads. Also the overall sex ratio of this sample is closer to our a priori expectations.

Finally let’s correct the sex genotypes in the data.

#LOCAL R

# run this command for Omy
genos_0.2 %<>%
  left_join(select(sex_marker_info, sample, adapter, corrected_sex_genotype), by = c("sample_simple"="sample", "adapter"="adapter")) %>%
  mutate(OmyY1_2SEXY = corrected_sex_genotype) %>%
  select(-corrected_sex_genotype)
#LOCAL R

# run this command for Ots
genos_0.2 %<>%
  left_join(select(sex_marker_info, sample, adapter, corrected_sex_genotype), by = c("sample_simple"="sample", "adapter"="adapter")) %>%
  mutate(`Ots_SEXY3-1` = corrected_sex_genotype) %>%
  select(-corrected_sex_genotype)

Filtering

Control and replicates have been removed, now it’s time for filtering.

Filtering Summary
We take an iterative approach to filtering:

First remove worst individuals and genotypes: - GTperc_cutoff=30 (indivudals greater than 30% missing data excluded) - Missingness (loci) > 50% (loci with total missing data > 50% removed) - IFI_cutoff = 10 (i.e. >10% background reads)

Then recalculate missingness and IFI - IFI_cutoff=2.5
- GTperc_cutoff=90 (inds greater than 10% missing data excluded)
- Missingness (loci) > 20%

Then examine for paralogues among markers with
- Missingness (loci) > 10% - examine for allele correction issues
- Markers where heterozygotes and “in-betweeners” do not follow 1:1 ratio of allele counts - Markers with high variance in ratio of allele counts at heteroyzgotes and “in-betweeners” - Remove monomorphic SNPs

IFI and Missingness

First we filter individuals and loci on IFI, and missingness.

Let’s take a look at the distribution of these values before any filtering

#LOCAL R

ggplot(genos_0.2)+geom_histogram(aes(x=IFI))+geom_vline(aes(xintercept= 2.5), color="red")+theme_classic()

ggplot(genos_0.2)+geom_histogram(aes(x=`%GT`))+geom_vline(aes(xintercept= 90), color="red")+theme_classic()

missingness <- (colSums(genos_0.2[,c(9:ncol(genos_0.2))] == "00" | genos_0.2[,c(8:(ncol(genos_0.2)-1))] == "0"))/nrow(genos_0.2) #warning hardcoding: "[,8:398]" is hardcoded to work on the example script using the Omy panel with 390 markers, these values will need to be changed to reflect the genotype columns of the genos r object that YOU are running. This excludes columns with metadata and genotyping results such as "sample name" "ifi" "on-target reads" etc
missing <- as.data.frame(missingness)
missing$marker <- row.names(missing)
ggplot(missing) + geom_histogram(aes(x=missingness))+geom_vline(aes(xintercept= 0.2), color="red")+geom_vline(aes(xintercept= 0.1), color="blue")+theme_classic()+xlab("missingness (loci)")

0.3: Extremely Bad Loci and Individuals Excluded

First remove the individuals and markers that clearly failed to genotype correctly (one step at a time)

#print table of bad missingness individual
kable(genos_0.2 %>%
  filter(`%GT` < 70) %>%
    select(2,4,5,6,7), caption = "Individuals with high missingess (>30% missing data)")
Individuals with high missingess (>30% missing data)
sample_simple Raw Reads On-Target Reads %On-Target %GT
OmyAC19ROGR_4914 331506 4818 1.45 38.87
# now remove them
genos_0.3 <- genos_0.2 %>%
  filter(`%GT` > 70)

#now recalculate locus level missingness after removing the worst individuals
  
missingness2 <- (colSums(genos_0.3[,c(9:(ncol(genos_0.3)))] == "00" | genos_0.3[,c(9:(ncol(genos_0.3)))] == "0"))/nrow(genos_0.3) 
missing2 <- as.data.frame(missingness2)
missing2$marker <- row.names(missing2)

#then remove these markers
# collect bad markers
very_bad_markers <- missing2[missing2$missingness2>0.5, 2]
print(paste(length(very_bad_markers), "markers with > 50% missing data"))
## [1] "3 markers with > 50% missing data"
#write the new dataset
genos_0.3 <- genos_0.3 %>%
  dplyr::select(-one_of(very_bad_markers))

#then recalculate IFI
# IFI is equal to the percentage of "background" reads to homozygote reads. Two types of reads contribute to background count: (1) Reads from the alternative allele when an individual has been called as homozygote at a locus, and (2) reads from the less frequent allele when the individual has been called as "in-betweener". We update the IFI score by including only markers in the filtered dataset


IFI <- marker_info %>%
  filter(marker %in% colnames(genos_0.3)) %>%
  group_by(ind) %>%
  summarize(back_count = sum(a1_count[called_geno == "A2HOM"], na.rm = TRUE)
            + sum(a2_count[called_geno == "A1HOM"], na.rm = TRUE)
            + sum(a1_count[is.na(called_geno) == TRUE & ((a1_count + a2_count)>=10) & (a2_count > a1_count)], na.rm = TRUE )
            + sum(a2_count[is.na(called_geno) == TRUE & ((a1_count + a2_count)>=10) & (a1_count > a2_count)], na.rm = TRUE ),
            
            hom_ct = sum(a1_count[called_geno == "A1HOM"], na.rm = TRUE)
            + sum(a2_count[called_geno == "A2HOM"], na.rm = TRUE)
            + sum(a2_count[is.na(called_geno) == TRUE & ((a1_count + a2_count)>=10) & (a2_count > a1_count)], na.rm = TRUE )
            + sum(a1_count[is.na(called_geno) == TRUE & ((a1_count + a2_count)>=10) & (a1_count > a2_count)], na.rm = TRUE ),
            
            ifi2 = (back_count/hom_ct)*100)

# the "marker_info" file we produced earlier used the filename of the genos file as the sample name (column name "ind"), but the sample names in our local R dataframes are very cleaned up (see line 504). Here I attempt to do the same using some regex in R using the standardized codes for sample naming at SFGL, but note that depending on how your fastq files are named, these exact matches may not work for you
# until we find a better solution I suggest two alternatives if this regex below breaks
# 1: if the number of high IFI samples is very low, just write the sample names out manually to a vector and use this to filter
# 2:

IFI$sample <- str_extract(IFI$ind, "[:upper:][:lower:]{2}[AJCU][RC]\\d{2}\\w{4}_\\d{4}")
IFI$adapter <- str_extract(IFI$ind, "[ATCG]{6}-[ATCG]{6}") 


genos_0.3 <- genos_0.3 %>%
  left_join(select(IFI, sample, adapter, ifi2), by = c("sample_simple" = "sample", "adapter" = "adapter")) %>%
  mutate(IFI = ifi2) %>%
  select(-one_of("ifi2"))

# now filter on IFI
#print table of bad IFI samples
kable(genos_0.3 %>%
  filter(IFI >10) %>%
    select(2:7), caption = "Extreme High IFI (>10) samples (low confidence barcodes)")
Extreme High IFI (>10) samples (low confidence barcodes)
sample_simple adapter Raw Reads On-Target Reads %On-Target %GT
#update the  dataset
genos_0.3 <- genos_0.3 %>%
  filter(IFI < 10)

Filtering log 0.2 -> 0.3:
1 samples removed with genotying success less than 70%
3 loci removed with > 50% missingness
0 samples with high IFI

0.4 Second Iteration Filter

Next we do the same process, but at the final filtering levels:

  • IFI_cutoff=2.5
  • GTperc_cutoff=90 (inds greater than 10% missing data excluded)
  • Missingness (loci) > 20%
#print table of bad missingness individual
kable(genos_0.3 %>%
  filter(`%GT` < 90) %>%
    select(2,4,5,6,7,8), caption = "Individuals with high missingess (>10% missing data)")
Individuals with high missingess (>10% missing data)
sample_simple Raw Reads On-Target Reads %On-Target %GT IFI
OmyAC19ROGR_4947 514196 13727 2.67 70.59 1.2348285
OmyAC19ROGR_5725 480551 13611 2.83 79.80 1.4192849
OmyJC19ROGR_0069 183568 22606 12.31 86.70 0.4337986
OmyJC19ROGR_0086 198018 12917 6.52 72.12 1.4722894
# now remove them
genos_0.4 <- genos_0.3 %>%
  filter(`%GT` > 90)

#now recalculate locus level missingness after removing the worst individuals
  
missingness3 <- (colSums(genos_0.4[,c(9:(ncol(genos_0.4)))] == "00" | genos_0.4[,c(9:(ncol(genos_0.4)))] == "0"))/nrow(genos_0.4) 
missing3 <- as.data.frame(missingness3)
missing3$marker <- row.names(missing3)

#then remove these markers
# collect bad markers
bad_markers <- missing3[missing3$missingness3>0.2, 2]
print(paste(length(bad_markers), "markers with > 20% missing data"))
## [1] "10 markers with > 20% missing data"
#write the new dataset
genos_0.4 <- genos_0.4 %>%
  dplyr::select(-one_of(bad_markers))

#then recalculate IFI
# IFI is equal to the percentage of "background" reads to homozygote reads. Two types of reads contribute to background count: (1) Reads from the alternative allele when an individual has been called as homozygote at a locus, and (2) reads from the less frequent allele when the individual has been called as "in-betweener"

IFI <- marker_info %>%
  filter(marker %in% colnames(genos_0.4)) %>%
  group_by(ind) %>%
  summarize(back_count = sum(a1_count[called_geno == "A2HOM"], na.rm = TRUE)
            + sum(a2_count[called_geno == "A1HOM"], na.rm = TRUE)
            + sum(a1_count[is.na(called_geno) == TRUE & ((a1_count + a2_count)>=10) & (a2_count > a1_count)], na.rm = TRUE )
            + sum(a2_count[is.na(called_geno) == TRUE & ((a1_count + a2_count)>=10) & (a1_count > a2_count)], na.rm = TRUE ),
            
            hom_ct = sum(a1_count[called_geno == "A1HOM"], na.rm = TRUE)
            + sum(a2_count[called_geno == "A2HOM"], na.rm = TRUE)
            + sum(a2_count[is.na(called_geno) == TRUE & ((a1_count + a2_count)>=10) & (a2_count > a1_count)], na.rm = TRUE )
            + sum(a1_count[is.na(called_geno) == TRUE & ((a1_count + a2_count)>=10) & (a1_count > a2_count)], na.rm = TRUE ),
            
            ifi2 = (back_count/hom_ct)*100)

# the "marker_info" file we produced earlier used the filename of the genos file as the sample name (column name "ind"), but the sample names in our local R dataframes are very cleaned up (see line 504). Here I attempt to do the same using some regex in R using the standardized codes for sample naming at SFGL, but note that depending on how your fastq files are named, these exact matches may not work for you
# until we find a better solution I suggest two alternatives if this regex below breaks
# 1: if the number of high IFI samples is very low, just write the sample names out manually to a vector and use this to filter
# 2: 

IFI$sample <- str_extract(IFI$ind, "[:upper:][:lower:]{2}[AJCU][RC]\\d{2}\\w{4}_\\d{4}")
IFI$adapter <- str_extract(IFI$ind, "[ATCG]{6}-[ATCG]{6}") 


genos_0.4 <- genos_0.4 %>%
  left_join(select(IFI, sample, adapter, ifi2), by = c("sample_simple" = "sample", "adapter" = "adapter")) %>%
  mutate(IFI = ifi2) %>%
  select(-one_of("ifi2"))

# now filter on IFI
#print table of bad IFI samples
kable(genos_0.4 %>%
  filter(IFI >2.5) %>%
    select(2:7), caption = "High IFI (>2.5) samples (low confidence barcodes)")
High IFI (>2.5) samples (low confidence barcodes)
sample_simple adapter Raw Reads On-Target Reads %On-Target %GT
#update the  dataset
genos_0.4 <- genos_0.4 %>%
  filter(IFI < 2.5)

Note that the table above is blank in the example script because 0 individuals showed high contamination.

0.3 -> 0.4 Filtering Log

Filtered out:
4 individuals with <90% genotying success (i.e. greater than 10% missing data)
10 markers with > 20% missingness
0 contaminated samples (note here that all the samples with high IFI are already removed by the individual level missingness step in the example data)

0.5: Removing Paralogs

Now we manually examine allele counts for markers that may tag paralogues regions. Because our panels can contain hundreds of loci, we flag three types of markers for close scrutiny (below), but this is informal and you can also look at any marker you want using some of the scripts below.
- Missingness (loci) > 10% - examine for allele correction issues
- Markers where heterozygotes and “in-betweeners” do not follow 1:1 ratio of allele counts - Markers with high variance in ratio of allele counts at heteroyzgotes and “in-betweeners”

Let’s collect these markers, first markers with high missingness (10-20% missingness)

# Local R

#get marker names of markers with 0.1 > missingness > 0.2
miss0.1 <- missing3[missing3$missingness3 > 0.1,]
miss_mod <- miss0.1[miss0.1$missingness3 < 0.2, 2]

Next, markers with skewed allele count ratios and allele ratios with high variance. We do this by fitting a linear model between allele 1 counts and allele 2 counts and then flagging markers with a ratio of > 1.5 (3/2) and less than 2/3. We also flag markers where the fit

library(lme4)
hets <- filter(marker_info, called_geno == "HET" | is.na(called_geno))

models <- hets %>%
  filter(marker %in% colnames(genos_0.4)) %>%
  filter(is.na(a1_count) == FALSE & is.na(a2_count) == FALSE) %>%
  group_by(marker) %>%
  group_map(~ lm(a1_count ~ a2_count, data= .))

# Apply coef to each model and return a list of allele count ratios
lms <- lapply(models, coef)
ggplot()+geom_histogram(aes(x = sapply(lms,`[`,2)))+theme_classic()+ggtitle("allele ratios for all NA and HET calls")+geom_vline(aes(xintercept = 1.5), color = "red", linetype = 2)+geom_vline(aes(xintercept = (2/3)), color = "red", linetype = 2)+xlab("allele ratio (a1/a2)")+geom_vline(aes(xintercept = 1), color = "black")

#list of p-values
lms_anova <- lapply(models, summary)


# collect info about each bad model
paralog_possible <- which(abs(sapply(lms,`[`,2)) > 1.5) #bad because a positively skewed allele ratio
paralog_possible2 <- which(abs(sapply(lms,`[`,2)) < (2/3)) # bad because a negative skewed allele ratio

paralog_possible3 <- which(sapply(lms_anova, function(x) x$coefficients[,4][2])> 0.01) # bad because too much variance in allele ratio, even if mean ratio is 1

paralog_possible <- c(paralog_possible, paralog_possible2, paralog_possible3)

There are 5 markers with moderately poor genotyping success (i.e. less than 20% missingess, but greater than 10% missingness) in the example data. For this example pipeline we’re using some good markers to visualize this process instead.

Previously these types of markers were filtered on an ad hoc basis based on the extent of individuals with unclear genotypes (a lot of individuals not falling in the clear cutoffs for GT). Will attempt the same below:

Let’s explore look at one of the markers individual. We’ll create a similar plot to those generated by the GTseq figures script.

Below we show plots for one representative good marker, and two representative bad markers with separate issues. Then we’ll discuss

# R LOCAL

# this is an example code chunk don't run it in your own pipeline

marker_info2 <- read_csv("example_data/genotype_data/marker_info_original.csv")
marker_info2$a1_count <- as.numeric(substr(marker_info2$a1_count, 3, nchar(marker_info2$a1_count)))
marker_info2$a2_count <- as.numeric(substr(marker_info2$a2_count, 3, nchar(marker_info2$a2_count)))



ggplot(data=marker_info2[marker_info2$marker=="OMS00008",])+geom_point(aes(a1_count, a2_count, color = called_geno), alpha = 0.7, size =2)+theme_classic()+geom_abline(aes(slope=1, intercept=0))+geom_abline(aes(slope = 10, intercept=0), color = "green")+geom_abline(aes(slope = 0.1, intercept=0), color = "red")+geom_abline(aes(slope = 0.2, intercept=0), color = "blue")+geom_abline(aes(slope = 5, intercept=0), color = "blue")+coord_equal(ratio=1)+geom_abline(slope = -1, intercept = 10)+xlim(c(0,150))+ylim(c(0,150))+ggtitle("Good Marker")

ggplot(data=marker_info2[marker_info2$marker=="Omy_RAD46672-27",])+geom_point(aes(a1_count, a2_count, color = called_geno), alpha = 0.7, size =2)+theme_classic()+geom_abline(aes(slope=1, intercept=0))+geom_abline(aes(slope = 10, intercept=0), color = "green")+geom_abline(aes(slope = 0.1, intercept=0), color = "red")+geom_abline(aes(slope = 0.2, intercept=0), color = "blue")+geom_abline(aes(slope = 5, intercept=0), color = "blue")+coord_equal(ratio=1)+geom_abline(slope = -1, intercept = 10)+xlim(c(0,150))+ylim(c(0,150))+ggtitle("Bad Marker, A2 Bias")

ggplot(data=marker_info2[marker_info2$marker=="Omy_RAD52458-17",])+geom_point(aes(a1_count, a2_count, color = called_geno), alpha = 0.7, size =2)+theme_classic()+geom_abline(aes(slope=1, intercept=0))+geom_abline(aes(slope = 10, intercept=0), color = "green")+geom_abline(aes(slope = 0.1, intercept=0), color = "red")+geom_abline(aes(slope = 0.2, intercept=0), color = "blue")+geom_abline(aes(slope = 5, intercept=0), color = "blue")+coord_equal(ratio=1)+geom_abline(slope = -1, intercept = 10)+xlim(c(0,150))+ylim(c(0,150))+ggtitle("Bad Marker\nA1 bias in a subset of samples")

The good marker here demonstrates the qualities we’re looking for. Heterozygotes are distributed around a 1:1 ratio of alternative alleles, and homozygotes have few to no reads at the alternative allele.

The first bad marker shows a typical issue, reads among putative heterozygotes are biased towards the A2 allele. This is potentially due to a paralogue at this locus and we would filter this SNP out of the final dataset (or attempt to rescue it by changing the allele correction values if we are optimizing the panel for the first few runs).

The second bad marker shows a less common problem. There is bias towards the A1 allele, but only in a subset of individuals. One explanation for this observation is that there is polymorphism among the paralogs. In any case, toss this locus.

When actually running the pipeline for yourself use the R code chunk below. This code chunk builds similar graphs for any bad loci (missingess 10% - 20%, skewed and high variance allele count ratios) and allows you make a decision on each one.

# R Local

plots <- marker_info %>%
  filter(marker %in% colnames(genos_0.4)) %>%
  filter(is.na(a1_count) == FALSE & is.na(a2_count) == FALSE) %>%
  group_by(marker) %>%
  do(plots=ggplot(data=.)+geom_point(aes(a1_count, a2_count, color = called_geno))+theme_classic()+geom_abline(aes(slope=1, intercept=0))+geom_abline(aes(slope = 10, intercept=0), color = "green")+geom_abline(aes(slope = 0.1, intercept=0), color = "red")+geom_abline(aes(slope = 0.2, intercept=0), color = "blue")+geom_abline(aes(slope = 5, intercept=0), color = "blue")+coord_equal(ratio=1)+geom_abline(slope = -1, intercept = 10)+ggtitle(unique(.$marker)))

#plot all "bad markers"

#first add the missningness markers to the list to examine
mod_bad_plot_index <- which(plots$marker %in% miss_mod)
paralog_possible <- c(mod_bad_plot_index, paralog_possible)

# then loop through the plots by changing the index (here 33) until you have looked at all your questionable markers
plots$plots[[paralog_possible[10]]] #manually looped through these plots by changing the index for all 33 moderately bad markers, could make an lapply loop in the future, bad markers reported below

Remove the bad markers

# Local R

to_filt <- c("Omy_RAD46672-27", "Omy_RAD52458-17") # here list your bad marker names, if you have so many that this is unwieldy check out code snippet at bottom of this chunk
genos_0.5 <- genos_0.4 %>%
  dplyr::select(-one_of(to_filt))

Monomorphic Markers and Duplicates

1.0 Monomorphic Markers

To generate the 1.0 dataset, we remove monomorphic markers

genos_1.0 <- genos_0.5 %>% 
  select_if(~ length(unique(.)) > 1)

Duplicate Samples

Some sample tissues are provided in batches of fin clips introducing the possibility for unknown duplicates in the sample. Let’s make sure no fin clips broke apart leading to a single individual to be represented twice in the dataset. You may not want to run this if you are confident your samples do not contain duplicates Rather than fussing with installing coancestry for windows on a unix system, estimated relatedness using an R package (related) which can implement the code from Coancestry on any operating system.

Alternatively, to run the code in Coancestry on a windows machine, use the GUI.

We used the estimator from Lynch and Ritland 1999 #not dyadic likelihood estimator, Milligan (2003)

# Local R

# The input file for coancestry needs a unique row for each indiviudal and two columns for each diploid locus, and alleles as integers
# throw out metadata and sex marker and wrote to a file
just_genos <- genos_1.0[,c(2, c(9:(ncol(genos_1.0)-1)))] #note possible hardcoding here (just like missingness), if this breaks edit the columns so that it grabs only the simple sample name and genotype data
write_tsv(just_genos, "./example_data/genotype_data/just_genos.txt")

# first remove header
# then convert all NA fields to "00"
# then convert all "-" (used for indels and complex probes) to 5
# then split all the genotype values using regex in a text editor 
# here's an example regex for this changes: find string: \t([ATGC05])([ATCG05])  replace string: \t\1\t\2
#convert alleles to integer T-1 G->2 etc, for indels / big probes (denoted with a "-") already converted to "5"

# now run coancestry
#rmat <- coancestry("./genotype_data/just_genos.txt", dyadml = 1)
rmat2 <- coancestry("./example_data/genotype_data/just_genos.txt", lynchrd  = 1)

# save the relevant info so we don't have to run this over and over and take up a ton of diskspace
rmat_to_save <- rmat2$relatedness[rmat2$relatedness$lynchrd > 0.5,]
save(rmat_to_save, file="./example_data/genotype_data/relatedness.Rdata")

Check for highly related individuals and remove any >= 0.95 from the dataset

# LOCAL R

#Check for relatedness
load(file = "./example_data/genotype_data/relatedness.Rdata")
#ggplot(rmat_to_save$relatedness)+geom_histogram(aes(x=lynchrd))+theme_classic()
rmat_to_save[which(rmat_to_save$lynchrd >=0.95), c(1:3)]

dup_inds <- rmat_to_save[which(rmat_to_save$lynchrd >= 0.95), c(1:3)]

#if you used the coancestry GUI, you can just create a vector here manually like below
#dup_inds <- c("dupicate sample name 1", "dupicate sample name 2" , etc)

genos_2.0 <- genos_1.0 %>%
  filter(!(Sample %in% dup_inds$ind2.id))

File Conversion and Stats

Final step of genotyping is to collect some stats about the genotype dataset and reformat the genotype file into common formats for import into other programs.

Stats

Here are some summary stats and figures from your filtered dataset

# LOCAL R

ggplot(genos_2.0)+geom_density(aes(x=`On-Target Reads`))+geom_vline(aes(xintercept=median(`On-Target Reads`)), color = "red") +theme_classic()
On Target Read Distribution

On Target Read Distribution

#LOCAL R
ggplot(genos_2.0)+geom_density(aes(x=`%On-Target`))+geom_vline(aes(xintercept=median(`%On-Target`)), color = "red") +theme_classic()
Proportion on Target

Proportion on Target

Depths

#LOCAL R
marker_info %>%
  filter(marker %in% colnames(genos_2.0)) %>%
  filter(ind %in% genos_2.0$Sample) %>%
  mutate(sumdepth=a1_count+a2_count) %>%
  summarise(mean=mean(sumdepth, na.rm = TRUE), median=median(sumdepth, na.rm = TRUE), sd=sd(sumdepth, na.rm = TRUE))
marker_info %>%
  filter(marker %in% colnames(genos_2.0)) %>%
  filter(ind %in% genos_2.0$Sample) %>%
  mutate(sumdepth=a1_count+a2_count) %>%
  ggplot + aes(x=sumdepth)+geom_histogram()+theme_classic()+xlab("Mean Depth Per Locus Per Individual")

marker_info %>%
  filter(marker %in% colnames(genos_2.0)) %>%
  filter(ind %in% genos_2.0$Sample) %>%
  mutate(sumdepth=a1_count+a2_count) %>%
  ggplot + aes(x=sumdepth)+geom_histogram()+theme_classic()+xlab("Mean Depth Per Locus Per Individual")+xlim(0,1000)+ggtitle("inset of plot above from 0 to 1000 reads")

Conversion

Let’s get some usable file formats

Here’s adegenet’s genind object

#LOCAL R

# Convert to genind for import into adegenet

#first get a matrix to work on

#first change column to not include a dot
genos_2.1 <- genos_2.0
colnames(genos_2.1) <- gsub("\\.", "_", colnames(genos_2.1))
#convert to matrix with inds as row names
genos_2.1 <- as.matrix(genos_2.1[,c(9:(ncol(genos_2.1)-1))]) #caution potential hardcoding to exclude sex marker, get rid of the "-1" if you don't have a sex marker
row.names(genos_2.1) <- genos_2.0$sample_simple
genind_1.0 <- df2genind(genos_2.1, sep ="", ploidy=2,NA.char = "0")

#add in the populations
genos_2.2 <- genos_2.0 %>%
  left_join(meta_data, by=c("sample_simple" = "ID")) %>%
  select(-c(Sample, adapter)) %>% # dont need this anymore
  rename(sample = sample_simple) %>% #rename this column
  relocate(sample, Date, year, run) # reorder the columns, this may be different depending on the metadata columns you added


genind_1.0@pop <- as.factor(genos_2.2$run) # this might change depending on metadata, for example if yo wanted to name by the column "pop" in your metadata table, change to as.factor(genos_2.2$pop) 

Here’s a general approach using radiator package

# LOCAL R

# note didn't do this yet, but check out the command: 
radiator::genomic_converter()

Finally, save your files as R objects for further analysis.

# LOCAL R

# here we save a few objects with useful info
genind_2.0 <- genind_1.0
save(genos_2.2, file ="example_data/genotype_data/genotypes_2.2.R")
save(genind_2.0, file= "example_data/genotype_data/genind_2.0.R")